Releases: Mozilla-Ocho/llamafile
llamafile v0.4.1
llamafile lets you distribute and run LLMs with a single file
If you had trouble generating filenames following the "bash one-liners"
blog post using the latest release, then please try again.
- 0984ed8 Fix regression with --grammar flag
Crashes on older Intel / AMD systems should be fixed:
- 3490afa Fix SIGILL on older Intel/AMD CPUs w/o F16C
The OpenAI API compatible endpoint has been improved.
- 9e4bf29 Fix OpenAI server sampling w.r.t. temp and seed
This release improves the documentation.
- 5c7ff6e Improve llamafile manual
- 658b18a Add WSL CUDA to GPU section (#105)
- 586b408 Update README.md so links and curl commands work (#136)
- a56ffd4 Update README to clarify Darwin kernel versioning
- 47d8a8f Fix README changing SSE3 to SSSE3
- 4da8e2e Fix README examples for certain UNIX shells
- faa7430 Change README to list Mixtral Q5 (instead of Q3)
- 6b0b64f Fix CLI README examples
We're making strides to automating our testing process.
Some other improvements:
- 9e972b2 Improve README examples
- 9de5686 Support bos token in llava-cli
- 3d81e22 Set logger callback for Apple Metal
- 9579b73 Make it easier to override CPPFLAGS
Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:
- https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
- https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/tree/main
- https://huggingface.co/jartine/wizardcoder-13b-python/tree/main
- https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
Known Issues
LLaVA image processing using the builtin tinyBLAS library may go slow on Windows.
Here's the workaround for using the faster NVIDIA cuBLAS library instead.
- Delete the
.llamafile
directory in your home directory. - Install CUDA
- Install MSVC
- Open the "x64 MSVC command prompt" from Start
- Run llamafile there for the first invocation.
There's a YouTube video tutorial on doing this here: https://youtu.be/d1Fnfvat6nM?si=W6Y0miZ9zVBHySFj
llamafile v0.4
llamafile lets you distribute and run LLMs with a single file
This release features Mixtral support. Support has been added for Qwen
models too. The --chatml
, --samplers
, and other flags are added.
- 820d42d Synchronize with llama.cpp upstream
GPU now works out of the box on Windows. You still need to pass the
-ngl 35
flag, but you're no longer required to install CUDA/MSVC.
- a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
- 9d85a72 Improve GEMM performance by nearly 2x (#93)
- 72e1c72 Support CUDA without cuBLAS (#82)
- 2849b08 Make it possible for CUDA to extract prebuilt DSOs
Additional fixes and improvements:
- c236a71 Improve markdown and syntax highlighting in server (#88)
- 69ec1e4 Update the llamafile manual
- 782c81c Add SD ops, kernels
- 93178c9 Polyfill $HOME on some Windows systems
- fcc727a Write log to /dev/null when main.log fails to open
- 77cecbe Fix handling of characters that span multiple tokens when streaming
Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:
llamafile v0.3
llamafile lets you distribute and run LLMs with a single file
The llamafile-main
and llamafile-llava-cli
programs have been
unified into a single command named llamafile
. Man pages now exist in
pdf, troff, and postscript format. There's much better support for shell
scripting, thanks to a new --silent-prompt
flag. It's now possible to
shell script vision models like LLaVA using grammar constraints.
- d4e2388 Add --version flag
- baf216a Make ctrl-c work better
- 762ad79 Add
make install
build rule - 7a3e557 Write man pages for all commands
- c895a44 Remove stdout logging in llava-cli
- 6cb036c Make LLaVA more shell script friendly
- 28d3160 Introduce --silent-prompt flag to main
- 1cd334f Allow --grammar to be used on --image prompts
The OpenAI API in llamafile-server
has been improved.
- e8c92bc Make OpenAI API
stop
field optional (#36) - c1c8683 Avoid bind() conflicts on port 8080 w/ server
- 8cb9fd8 Recognize cache_prompt parameter in OpenAI API
Performance regressions have been fixed for Intel and AMD users.
- 73ee0b1 Add runtime dispatching for Q5 weights
- 36b103e Make Q2/Q3 weights go 2x faster on AMD64 AVX2 CPUs
- b4dea04 Slightly speed up LLaVA runtime dispatch on Intel
The zipalign
command is now feature complete.
- 76d47c0 Put finishing touches on zipalign tool
- 7b2fbcb Add support for replacing zip files to zipalign
Some additional improvements:
- 5f69bb9 Add SVG logo
- cd0fae0 Make memory map loader go much faster on MacOS
- c8cd8e1 Fix output path in llamafile-quantize
- dd1e0cd Support attention_bias on LLaMA architecture
- 55467d9 Fix integer overflow during quantization
- ff1b437 Have makefile download cosmocc automatically
- a7cc180 Update grammar-parser.cpp (#48)
- 61944b5 Disable pledge on systems with GPUs
- ccc377e Log cuda build command to stderr
Our .llamafiles on Hugging Face have been updated to incorporate these new release binaries. You can redownload here:
- https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
- https://huggingface.co/jartine/mistral-7b.llamafile/tree/main
- https://huggingface.co/jartine/wizardcoder-13b-python/tree/main
If you have a slower Internet connection and don't want to re-download, then you don't have to! Instructions are here:
llamafile v0.2.1
llamafile lets you distribute and run LLMs with a single file. See our README file for documentation and to learn more.
Changes
- 95703b6 Fix support for old Intel CPUs
- 401dd08 Add OpenAI API compatibility to server
- e5c2315 Make server open tab in browser on startup
- 865462f Cherry pick StableLM support from llama.cpp
- 8f21460 Introduce pledge() / seccomp security to llama.cpp
- 711344b Fix server so it doesn't consume 100% cpu when idle
- 12f4319 Add single-client multi-prompt support to server
- c64989a Add --log-disable flag to server
- 90fa20f Fix typical sampling (#4261)
- e574488
reserve
space indecode_utf8
- 481b6a5 Look for GGML DSO before looking for NVCC
- 41f243e Check for i/o errors in httplib read_file()
- ed87fdb Fix uninitialized variables in server
- c5d35b0 Avoid CUDA assertion error with some models
- c373b5d Fix LLaVA regression for square images
- 176e54f Fix server crash when prompt exceeds context size
Example Llamafiles
Our .llamafiles on Hugging Face have been updated to incorporate these new release binaries. You can redownload here:
- https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
- https://huggingface.co/jartine/mistral-7b.llamafile/tree/main
- https://huggingface.co/jartine/wizardcoder-13b-python/tree/main
If you have a slower Internet connection and don't want to re-download, then you don't have to! Instructions are here:
llamafile v0.2
Warning: This release was rolled back due to a Windows breakage caused by jart/cosmopolitan@7b3d7ee. Please use llamafile v0.2.1.
llamafile v0.1
llamafile lets you distribute and run LLMs with a single file. This is our first release! See our README file for documentation.